-
-
Notifications
You must be signed in to change notification settings - Fork 785
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add cdp support for xpath scrapers #625
Conversation
Another example of a cdp only scraper name: Bang
sceneByURL:
- action: scrapeXPath
url:
- https://www.bang.com/video
scraper: sceneScraper
xPathScrapers:
sceneScraper:
scene:
Title: //meta[@name="og:title"]/@content
Details: //meta[@name="description"]/@content
Image: //meta[@name="og:image"]/@content
Date:
selector: //span[@class="hidden-xs fa-text-right fa-text-left"]/text()
replace:
- regex: (\w+\s)(\d+)\w+,(\s\d{4})
with: $1$2$3
parseDate: Jan 2 2006
Tags:
Name:
selector: //div[@class="genres bottom-buffer10"]/a
Performers:
Name:
selector: //div/span[@class="fa-text-right" and contains(text(),"With:")]/../span[@class="comma-list-container"]/span/a/text()
Studio:
Name: //div/span[@class="fa-text-right" and contains(text(),"Studio:")]/a/text()
debug:
printHTML: true
driver:
useCDP: true
remote: true
sleep: 5 |
driver:
useCDP: false
remote: false
clearCookies: false |
I don't think allowing scraper configs to wipe Chrome cookies is safe or reasonable. Further, because it is in the config, there is no way of tailoring it to docker and non-docker instances. I think another way is needed to achieve this. I'm not very clear on the protocols, but maybe this might help? |
The clearCookies was supposed to be for users that write scrapers , not the end users. I needed to delete some cookies while testing a cdp scraper and didn't know how to. I agree as stated above that it isn't recommended though. I'll have a look at what you reference, if it creates something like an incognito mode i'll implement it in a later PR. EDIT merged with upstream code. Removed clearCoookies code. Basic functionality should be complete. |
I'm not sure that the |
Yes i was also reluctant about this. I was either going to put it in there or a scraper option beneath the |
Tried the following:
FWIW chrome isn't named |
Thats weird. If there is an error locating a chrome instance you should get a red graphql message like this in the UI i dont know how cdp locates the chrome binary in windows but since you got some results i am fairly certain it got detected and used. Maybe having a look at the process manager in windows while running a stash cdp query can verify that. (You can increase the sleep option to 5 to force it to wait a little more) Running against the following urls:
i get everything scraped as expected from pure taboo |
My test system already had a puretaboo scraper (under a different filename) which was interfering with this test. I can confirm that the scraper appears to give the correct results when I add the chrome directory to the PATH. I haven't tried the remote configuration. I think the chrome configuration needs to be put into the general configuration. A free text field that accepts either a remote chrome instance URL, or a path to the chrome executable. This way the behaviour is clear and deterministic. |
Refactor CDP stuff
@WithoutPants your PR changes work fine using the |
Co-authored-by: WithoutPants <53250216+WithoutPants@users.noreply.github.com>
Adds an extra driver option to the scraper config
Using
makes loadURL to use chrome cdp to get the url and thus partially parse it.
Setting
useCDP
to false or not having thedriver
option http client is used instead.If
remote
is set to true then cdp looks for a remote instance of chrome in the address 127.0.0.1:9222 that is compatible with the chrome headless docker https://hub.docker.com/r/chromedp/headless-shell/If it is unset or set to false it defaults to looking google-chrome in the $PATH
sleep
is the time in seconds to wait before actually geting the page from dom. This is needed as some sites (bang.com for example) need more time for loading scripts to finish. If unset it defaults to 2 secs.Having this option allows us to create scrapers for sites that pull data from js.
The following puretaboo scraper is possible ( performers and tags can't be retrieved without cdp)
Use of the
debug
option is adviced so that you can look at the actual page text that is returned.